The empirical analysis is based on the joint availability of firm level data from three sources: 1) public US based firms in Compustat, 2) disambiguated patent assignee data from Kogan et al. (2017), the United States Patent and Trademark Office, and the Fung Institute at UC Berkeley (Balsmeier et al. 2018), and 3) Data on R&D tax credits from Wilson (2005). We build firm level patent portfolios by aggregating eventually granted US patents from 1976 (first year of availability) through 2007 inclusive (last year after last year of tax credit data). As we sometimes base our analysis on measures that have no obvious value in case of non-patenting activity or first time patenting activity, we sometimes only include firms in the analysis that applied for at least one patent in a given year, and patented at least once in any previous year, taking all patents granted to a given firm back to 1976 into account when calculating a firm’s known classes. Finally, we restrict the sample to firms that we observe at least twice and have non-missing values in any control variable. 

Results were produced with STATA version 14.  

To reproduce the analysis sample and results in the paper you need the following data:

- Compustat data (download on September 18, 2015 through WRDS): variables: emp (Employees), sale (Sales/Turnover (Net)), xrd (Research and Development Expense), gvkey (Global Company Key) – pls use your own version as Compustat does not allow making their data publicly available. 

- Consumer price inflation index from the International Monetary Fund: downloaded on November 6, 2018: https://data.imf.org, file: cpi1913_2017.dta, variable: cpi

- R&D tax credit data comes from Wilson (2007): downlaod on July 21, 2020: https://www.frbsf.org/our-people/economists/daniel-wilson/, dataset RDusercost.xls

- Patent data1: we start with the extended data till 2019, downloaded on August 9, 2020, from: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data. This data provides an updated data series to the CRSP "permno" match following the paper Kogan, L., Papanikolaou, D., Seru, A. and Stoffman, N., 2017. Technological innovation, resource allocation, and growth. Quarterly Journal of Economics, 132(2), pp. 665-712. The paper is available at https://academic.oup.com/qje/article/132/2/665/3076284. We keep only patents for which we have a firm identifier as provided by KPSS. Datasets: KPSS_2019_public.csv + KPSS_2017_public.csv (for fdates)

- Patent data2: Tech class data (uspc) comes from USPTO historical data, downloaded on August 17, 2018 at https://www.uspto.gov/ip-policy/economic-research/research-datasets/historical-patent-data-files. The annual dataset contains counts of in-force and issued patents from 1840 to 2014 by NBER sub-category.  The monthly file contains a monthly count of applications, issued patents, and in-force patents by application status, disposal type (abandoned, issued, or pending), and NBER sub-category from 1981 to 2014.  The monthly_disposal dataset contains counts of application by disposal type for each monthly application cohort by NBER sub-category from 1981 to 2014. The historical_masterfile contains micro-level application, NBER sub-category, and prosecution data on 2.2 million patent applications filed from 1981 to 2014 and 8.9 million patents issued through 2014. Three intermediate files (orders, orders_class, and orders_subclass) used to generate the four datasets are also available for download. A document describing these data is available as: Marco, Alan C. and Carley, Michael and Jackson, Steven and Myers, Amanda F., The USPTO Historical Patent Data Files: Two Centuries of Innovation (June 1, 2015). SSRN working paper, available at http://ssrn.com/abstract=2616724
Dataset: historical_masterfile_short.dta

- Patent data3: patent inventor data comes from Balsmeier et al. 2018, Machine learning and natural language processing on the patent corpus: Data, tools, and new measures, available at: https://onlinelibrary.wiley.com/doi/abs/10.1111/jems.12259, data for download at: https://doi.org/10.7910/DVN/KPMMPV. Dataset: inventor.geo.assignee.combo.disambig.tsv

- Patent data4: data on companies' publications in scientific journals and publications per patent comes from DISCERN: Duke Innovation & Scientific Enterprise Research Network. We used version 5, downloaded on Sept 15, 2020.: Further detailed information and data at: https://zenodo.org/records/3976774, dataset: DISCERN_Panel_Data_1980_2015.dta

 - Patent data5: Appropriability data comes from Cohen, Nelson and Walsh (2000), “Protecting Their Intellectual Assets: Appropriability Conditions and Why U.S. Manufacturing Firms Patent (or Not)” NBER working paper 7552, https://www.nber.org/papers/w7552 , Table1. We merged based on ISIC codes provided in CNW, data: cnw2000t1.dta, variable ‘patents’.

- Patent data6: data on backward science cites of patents comes from Reliance on Science. Further detailed information can be found at: 1. M. Marx & A. Fuegi, "Reliance on Science by Inventors: Hybrid Extraction of In-text Patent-to-Article Citations."  forthcoming in Journal of Economics and Management Strategy. (http://doi.org/10.1111/jems.12455) and 2. M. Marx, & A. Fuegi, "Reliance on Science: Worldwide Front-Page Patent Citations to Scientific Articles" (2020), Strategic Management Journal 41(9):1572-1594. (https://onlinelibrary.wiley.com/doi/full/10.1002/smj.3145) We used version 62, downloaded on December 7, 2023, https://zenodo.org/records/10215169, dataset: ‘_pcs_countsbypatent.csv’ 

- Patent data7: data on backward cites of patents comes from Patentsview, downloaded on September 7, 2020, https://patentsview.org/download/data-download-tables, dataset: ‘uspatentcitation.tsv’ 


 
Do-files:

- rndtax_credits_2024_main_tables_bfsv.do: creates all results in the paper but the within-firm estimations from firm-lab-state data 
- rndtax_credits_2024_Table_within_firmstate_bfsv.do: creates all results based on firm-lab-state data
- pr_tx_cr.do: prepares state tax credit data for estimations
- pr_txt_sim.do: prepares text-based patent measure for estimations
- cr_science_cites_comp.do: creates patent citation measures
- cr_patent_measures_wo_adj.do: creates patent measures without adjustment for similarity across tech classes
- cr_pat_measures.do: creates patent measure with adjustment for similarity across tech classes 
- cr_patent_measures_pat_level_pub.do: creates patent measures for merge with inventor data, this code must run after cr_patent_measures.do code (as it directly uses the file that was generated in MY_TEMP_PATH from that code).
- pr_inv_dta.do: creates lab-state data based on inventor locations
